Skip to content

Artem/fix gpu count#379

Open
vertex451 wants to merge 8 commits into
mainfrom
artem/fix-gpu-count
Open

Artem/fix gpu count#379
vertex451 wants to merge 8 commits into
mainfrom
artem/fix-gpu-count

Conversation

@vertex451

@vertex451 vertex451 commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Resolves:

akash-network/support#429

Details

I haven't found a bug in how do we handle inventory events.

Also, I've found recent report about GPU driver reporting wrong values.

That is why I decided to add a guardrails that prevents uint underflow, also I added logging, so we can track this in the future.

Changes

  • Safe int64→uint64 conversion - clamp negatives to 0 instead of raw cast; guard against -1 sentinel in Available() returning MaxInt64
  • Sanitize all resource quantities at ingest - reject negative values from knode.Status.Allocatable/Capacity for CPU, Memory, GPU, Storage
  • GPU allocatable upper-bound - clamp to Capacity if Allocatable > Capacity
  • Pod key uses namespace/name - prevents cross-namespace accounting collision
  • Handle watch.Added on node watcher reconnect - was silently dropped, leaving stale allocatable
  • podsWatch.Stop() on restart - old code drained one event and leaked the watcher
  • Precision-preserving Sub() - old Value() - Value() truncated milli-precision, drifting CPU allocated toward 0

Testing

Tested locally at non-gpu machine and checked if CPU is consistently changed compared to the main branch.
Checked CPU and memory changes at each stages:
Create a provider -> Create a deployment -> Create a lease -> Send a manifest -> Delete a deployment.

@coderabbitai

coderabbitai Bot commented Mar 23, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Context is threaded into inventory objects; resource quantities are sanitized and clamped to zero when negative or underflowing; pod tracking keys become namespace-aware; watchers restart cleanly and node events filter by name; tests verify clamping, underflow behavior, and allocatable change detection across both inventory metrics and node discovery.

Changes

Resource Quantity Sanitization and Context Threading

Layer / File(s) Summary
Inventory context threading and clamping setup
cluster/kube/operators/clients/inventory/client.go, cluster/kube/operators/clients/inventory/inventory.go
Added ctx context.Context field to inventory struct; subscriber flow constructs outgoing inventory messages with request context; newInventory accepts and stores context; introduced clampUint64 and clampAvailableUint64 helpers that log and clamp negative quantities to zero; dup() simplified to preserve context.
Inventory metrics computation with clamping
cluster/kube/operators/clients/inventory/inventory.go, cluster/kube/operators/clients/inventory/client_test.go
Metrics() derives logger from stored context and applies clamping helpers to CPU, GPU, Memory, and Ephemeral Storage totals, plus per-storage-class allocatable and available values, replacing unchecked conversions; test context wraps logger with logr.Discard(); new table-driven TestInventoryMetrics_negative_quantities_clamped streams clusters with mutated negative quantities and asserts metrics clamp all affected fields to zero, including per-storage-class and GPU available behavior.
Node resource sanitization
operator/inventory/node-discovery.go, operator/inventory/node-discovery_test.go
Added logr import; introduced sanitizeResourceQuantity helper that clamps negative node-reported values to zero with error logs; TestSanitizeResourceQuantity verifies sanitization returns zero for negative and zero quantities across resource types.
Node info population and allocatable change detection
operator/inventory/node-discovery.go, operator/inventory/node-discovery_test.go
updateNodeInfo accepts context, zero-resets allocatable and capacity fields before repopulating from node data, sanitizes negative quantities with logs, and enforces GPU allocatable ≤ GPU capacity; initNodeInfo calls refactored updateNodeInfo with context; allocatable change detection now uses resource.Quantity.Cmp() instead of raw value comparison; TestNodeAllocatableChanged_detectsMilliCPUChange verifies detection works across milli-CPU values.
Pod tracking with namespace-aware keying
operator/inventory/node-discovery.go
New podKey() helper constructs namespace/name composite keys; pod map initialization and all add/modify/delete event handlers updated to use namespace-aware keys, preventing cross-namespace collisions; restartPodsWatcher stops and nils existing watcher instead of draining; node watch handling now filters events by node name, handles both Added and Modified events, conditionally calls updateNodeInfo with context, and deep-copies node after processing.
Pod resource underflow accounting with logging
operator/inventory/node-discovery.go, operator/inventory/node-discovery_test.go
subAllocatedNLZ now accepts logger and node/resource identity, performs subtraction, detects underflow via comparison to zero, clamps to zero, and logs before/subtracted quantities; pod resource subtraction threads logging through all resource paths including emptyDir memory with dedicated resource identifier; TestSubAllocatedNLZ_underflow_clamped verifies underflow on subtraction is clamped to zero.

Sequence Diagram

sequenceDiagram
  participant NodeWatcher
  participant updateNodeInfo
  participant sanitizeResourceQuantity
  participant Metrics
  participant clampUint64
  NodeWatcher->>updateNodeInfo: ctx, node data
  updateNodeInfo->>updateNodeInfo: zero-reset fields
  updateNodeInfo->>sanitizeResourceQuantity: node allocatable
  alt negative value
    sanitizeResourceQuantity->>sanitizeResourceQuantity: log error
    sanitizeResourceQuantity->>updateNodeInfo: return 0
  else valid
    sanitizeResourceQuantity->>updateNodeInfo: return clamped
  end
  updateNodeInfo->>updateNodeInfo: store sanitized values
  NodeWatcher->>Metrics: compute inventory metrics
  Metrics->>clampUint64: allocatable total
  clampUint64->>Metrics: return clamped value
  Metrics->>Metrics: return metric set
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • troian

Poem

🐇 I hop through context, steady and spry,
I clip negatives down to a friendly 0, why —
Pods now wear namespaced shoes that fit,
Watchers restart clean; no more misfit! 🌿✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Artem/fix gpu count' is vague and does not clearly summarize the main changes; the PR addresses resource quantity sanitization across multiple resource types (CPU, Memory, GPU, Storage) and pod tracking improvements, not just GPU counting. Consider a more descriptive title such as 'Sanitize resource quantities and fix pod accounting' that better reflects the comprehensive nature of the defensive handling changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description is well-related to the changeset, providing detailed context about the fixes including uint underflow prevention, resource sanitization, pod key improvements, watch event handling, and precision preservation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch artem/fix-gpu-count

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread operator/inventory/node-discovery.go
Comment thread operator/inventory/node-discovery.go
Comment thread operator/inventory/node-discovery.go Outdated
Comment thread operator/inventory/node-discovery.go Outdated
@vertex451 vertex451 marked this pull request as ready for review March 25, 2026 13:45
@vertex451 vertex451 requested a review from a team as a code owner March 25, 2026 13:45
@vertex451 vertex451 self-assigned this Mar 25, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
operator/inventory/node-discovery.go (1)

541-546: ⚠️ Potential issue | 🟠 Major

Refresh node quantities on every Added/Modified event.

This guard skips updateNodeInfo() for capacity-only updates, and it also misses sub-core CPU changes because nodeAllocatableChanged() relies on resource.Quantity.Value(). That leaves stale allocatable/capacity data in node, so the new sanitization/clamping path is bypassed on a normal node update. restartPodsWatcher() can stay behind change detection if you want, but updateNodeInfo() itself should run for every watch.Added/watch.Modified.

🛠️ One safe way to split the refresh from the watcher restart
 				switch evt.Type {
 				case watch.Added:
 					fallthrough
 				case watch.Modified:
-					if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) {
-						updateNodeInfo(ctx, obj, &node)
+					updateNodeInfo(ctx, obj, &node)
+					if evt.Type == watch.Added || (knode != nil && nodeAllocatableChanged(knode, obj)) {
 						if err = restartPodsWatcher(); err != nil {
 							return err
 						}
 					}
 					signalLabels()
In `k8s.io/apimachinery/pkg/api/resource`, does `Quantity.Value()` preserve milli-CPU precision, or does a quantity like `500m` become `0`? What comparison should be used to detect CPU allocatable changes without truncation?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@operator/inventory/node-discovery.go` around lines 541 - 546, Always call
updateNodeInfo(...) for every watch.Added or watch.Modified event instead of
skipping it for capacity-only updates; keep nodeAllocatableChanged(...) solely
to decide whether to call restartPodsWatcher(), not to gate the update. Modify
the event handling in the Added/Modified case so updateNodeInfo(ctx, obj, &node)
is invoked unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cluster/kube/operators/clients/inventory/client_test.go`:
- Around line 406-408: The current test only compares
m.TotalAllocatable.Storage["default"] == 0 which false-passes if the "default"
key is missing; update the check in the anonymous check func to first assert the
map contains the "default" key (e.g. require.Contains/ok :=
m.TotalAllocatable.Storage["default"]) and then assert the value equals
uint64(0), referencing the existing m.TotalAllocatable.Storage map and the
"default" key.

In `@operator/inventory/node-discovery.go`:
- Around line 718-724: The current guard skips clamping when sanitized
gpuCapacity is 0, so change the logic in the block using gpuAllocatable and
gpuCapacity (variables gpuAllocatable, gpuCapacity and the
node.Resources.GPU.Quantity.Allocatable/Capacity fields) to clamp whenever
gpuAllocatable > gpuCapacity (remove the gpuCapacity > 0 check), call
node.Resources.GPU.Quantity.Allocatable.Set(gpuCapacity) and emit the same
log.Error with errGPUExceedsCapacity and context ("node", knode.Name,
"allocatable", gpuAllocatable, "capacity", gpuCapacity) so nodes with a
normalized capacity of 0 are correctly clamped.
- Around line 685-717: updateNodeInfo currently only updates fields present in
the incoming Allocatable/Capacity maps, leaving stale values for resources that
disappeared; before the two for-loops, explicitly reset the fields this function
owns: set node.Resources.CPU.Quantity.Allocatable.SetMilli(0) and
.Capacity.SetMilli(0), node.Resources.Memory.Quantity.Allocatable.Set(0) and
.Capacity.Set(0), node.Resources.EphemeralStorage.Quantity.Allocatable.Set(0)
and .Capacity.Set(0) (or .EphemeralStorage if that field name differs), and
node.Resources.GPU.Quantity.Allocatable.Set(0) and .Capacity.Set(0); then run
the existing loops to repopulate only current values so removed extended
resources won’t leave stale non-zero entries.

---

Duplicate comments:
In `@operator/inventory/node-discovery.go`:
- Around line 541-546: Always call updateNodeInfo(...) for every watch.Added or
watch.Modified event instead of skipping it for capacity-only updates; keep
nodeAllocatableChanged(...) solely to decide whether to call
restartPodsWatcher(), not to gate the update. Modify the event handling in the
Added/Modified case so updateNodeInfo(ctx, obj, &node) is invoked
unconditionally when evt.Type is Added or Modified, and only if
nodeAllocatableChanged(knode, obj) returns true then call restartPodsWatcher().
Update nodeAllocatableChanged(...) to compare CPU using Quantity.MilliValue()
(or Quantity.AsDec/Compare if using decimals) to avoid truncation of sub-core
(milli) CPU values and use proper Quantity.Cmp/Equal semantics for other
resources.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0116d940-8636-4609-8a76-8b97ce343688

📥 Commits

Reviewing files that changed from the base of the PR and between ba474df and b534eb1.

📒 Files selected for processing (5)
  • cluster/kube/operators/clients/inventory/client.go
  • cluster/kube/operators/clients/inventory/client_test.go
  • cluster/kube/operators/clients/inventory/inventory.go
  • operator/inventory/node-discovery.go
  • operator/inventory/node-discovery_test.go

Comment thread cluster/kube/operators/clients/inventory/client_test.go
Comment thread operator/inventory/node-discovery.go
Comment thread operator/inventory/node-discovery.go
Comment thread operator/inventory/node-discovery.go Outdated
Comment thread operator/inventory/node-discovery.go Outdated
Comment thread operator/inventory/node-discovery.go
@chalabi2 chalabi2 force-pushed the artem/fix-gpu-count branch from 45b5af0 to 7edb230 Compare June 20, 2026 21:27
chalabi2 added 2 commits June 20, 2026 15:08
Node modified events must refresh inventory state even when pod watcher restart is not required. The restart predicate now uses quantity comparison so milli-CPU allocatable changes are not missed.

Signed-off-by: Joseph Chalabi <chalabi.joseph@gmail.com>
Use Kubernetes namespaced names for pod accounting keys and simplify node watch event cases. Clarify why node status refresh resets only the fields it owns.

Signed-off-by: Joseph Chalabi <chalabi.joseph@gmail.com>
@chalabi2 chalabi2 changed the title Artem/fix gpu count Fix inventory resource quantity handling Jun 21, 2026
@chalabi2 chalabi2 changed the title Fix inventory resource quantity handling Artem/fix gpu count Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants